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MOMENTS  IN  STATISTICS: 
APPROXIMATIONS  TO  DENSITIES  AND 
GOODNESS-OF-FIT 

Michael  A.  Stephens 


Summary 

In  this  article  we  discuss  ways  in  which  moments  are  used  (a)  to  approximate 
distributions,  and  (b)  to  test  fit  to  a  given  distribution. 

1  Approximating  distributions  using  moments 

Solomon  and  Stephens  ( 1977)  give  a  number  of  examples  of  statistics  X  for 
which  the  first  few,  or  even  ail,  the  moments  or  cumuiants  may  be  found,  but 
whose  density  f(z)  and  distribution  F(z),  assumed  continuous,  are  intractable. 
A  good  example  is  the  statistic  S  whose  distribution  is  the  weighted  sum  of 
independent  chi-square  variables,  each  with  one  degree  of  freedom,  written 

t 

S  =  £A,(u,)2  (1) 

i=i 

where  u,  are  i.  i.  d.  jV(0,  1),  and  A,  are  known  weights.  Many  quantities  in  statis¬ 
tics  have  distributions  (often  asymptotic  distributions)  like  S ;  for  example,  the 
Pearson  X 2  statistic,  used  in  testing  fit  to  a  distribution  when  the  distribution 
tested  contains  unknown  parameters  which  are  estimated  by  maximising  the 
usual  likelihood,  rather  than  the  multinomial  likelihood,  has  this  distribution 
with  some  A,  £  1.  Other  goodness-of-fit  statistics,  of  Cramer-von  Mises  type, 
based  on  the  empirical  distribution  function  (EDF),  also  have  such  asymptotic 
distributions  (see,  for  example,  many  examples  in  Stephens,  1986a). 

One  of  the  first  examples  of  S  to  be  tabulated,  for  jfc  =  2,  involved  errors  in 
target  hitting  during  World  War  2:  tables  for  S  were  produced  with  some  labour 
by  Grad  and  Solomon  (1955)  using  analytic  methods.  These  have  been  extended 
by  various  authors  to  higher  values  of  k.  but  the  analysis  after  *  =  5  or  6  rapidly 
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Summary 

In  this  article  we  discuss  ways  in  which  moments  are  used  (a)  to  approximate 
distributions,  and  (b)  to  test  fit  to  a  given  distribution. 

1  Approximating  distributions  using  moments 

Solomon  and  Stephens  (1977)  give  a  number  of  examples  of  statistics  X  for 
which  the  first  few,  or  even  all,  the  moments  or  cumulants  may  be  found,  but 
whose  density  f(x)  and  distribution  F(x),  assumed  continuous,  are  intractable. 
A  good  example  is  the  statistic  5  whose  distribution  is  the  weighted  sum  of 
independent  chi-square  variables,  each  with  one  degree  of  freedom,  written 

s  =  X>(«.)2  (1) 

»=i 

where  u,-  are  i.  i.  d.  N( 0, 1),  and  A,-  are  known  weights.  Many  quantities  in  statis¬ 
tics  have  distributions  (often  asymptotic  distributions)  like  5;  for  example,  the 
Pearson  X 2  statistic,  used  in  testing  fit  to  a  distribution  when  the  distribution 
tested  contains  unknown  parameters  which  are  estimated  by  maximising  the 
usual  likelihood,  rather  than  the  multinomial  likelihood,  has  this  distribution 
with  some  A;  ^  1.  Other  goodness-of-fit  statistics,  of  Cramer-von  Mises  type, 
based  on  the  empirical  distribution  function  (EDF),  also  have  such  asymptotic 
distributions  (see,  for  example,  many  examples  in  Stephens,  1986a). 

One  of  the  first  examples  of  5  to  be  tabulated,  for  k  =  2,  involved  errors  in 
target  hitting  during  World  War  2:  tables  for  S  were  produced  with  some  labour 
by  Grad  and  Solomon  (1955)  using  analytic  methods.  These  have  been  extended 
by  various  authors  to  higher  values  of  k,  but  the  analysis  after  ib  =  5  or  6  rapidly 


becomes  very  difficult.  Thus  in  general  it  is  difficult  to  find  exact  percentage 
points  of  5,  but  the  cumulants  nr,  r  =  1,2,...,  are  very  easily  obtained: 

Kr  =  ^  AJ'2r-1(r  —  1)!  (2) 

«=i 


2  Moments  and  cumulants 


In  this  section  we  list  definitions.  The  r-th  moment  about  the  origin  of  a  random 
variable  X,  or  equivalently  of  its  distribution  /(*),  will  be  called  pj.;  the  r-th 
moment  about  the  mean  will  be  pr.  The  moment  generating  function  A/x(<)  of 
X  is  defined  by 


Mx(t)=  r  etxf(x)dx- 
J  —  OO 

when  expanded  as  a  Taylor  series, 


M  ut  4.  ^  4-  ^3<3  4-  4-  Vrf  , 

Mx(t)  =  1  +  ftt  +  -jjj-  +  -jp  +  •  •  •  +  + 


r! 


(3) 

(4) 


where  p  =  p,  is  the  mean  of  X. 

Cumulants  «r  are  defined  through  the  cumulant  generating  function  Cx  (<)  = 
log  Mx(t),  where  “log”  refers  to  natural  logarithm.  Then 


(5) 


Thus  in  principle  we  must  find  Mx(t)  before  finding  Cx(<). 

The  following  relationships  exist  between  low-order  moments  and  cumulants: 
=s  p'j  =  p;  «2  =  P2  =  ff2;  «3  =  P3i  =  P4  -  3p§.  Further  relationships  may 
be  found  in  Kendall  and  Stuart  (1977,  vol  1). 

Suppose  Z  =  X i  +  Xj  +  X3  +  . . .  +  Xk  where  X,  are  independent  random 
variables.  Then  a  property  of  moment  generating  functions  is 

Mz{t)  =  Mxi(t) . . .  Mxk(t), 


so  that 

Cz(t)  =  Cx,  (I)  +  Cx,{t)  +  •  +  CXk(t),  (6) 

and  it  quickly  follows,  using  obvious  notation,  that 

M2)  =  M*i)  +  MX2)  +  .  ■  +KrXk.  (7) 

This  additive  property  makes  it  very  easy  to  find  cumulants  of  sums  of  inde¬ 
pendent  random  variables,  and  hence,  for  example,  the  cumulants  of  5. 
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Two  important  Mx(t)  are  those  of  the  iV(/i,<r2)  distribution,  Mx(t)  = 
exp  (fit  +  <r2f2/ 2),  and  the  x2  distribution,  Afx(t)  =  1/(1  -  2t)p/2.  Finally, 
it  is  easily  shown  that  fiT(aX  +6)  =  ar nr(X),  for  r  >  2,  where  a  and  b  are  any 
real  constants,  and  Kr(aX  +  6)  =  arKr(X),  r  >  2. 

As  am  example,  consider  S.  If  X  has  a  x?  distribution,  the  MGF  of  X 
is  1/(1  —  2 1)1/2;  thus  Cx(t)  =  -|log(l  -  2<),  and  expansion  gives  Cx(t)  = 

t  +  2t2  +  ^  +  ^ff-  H - .  Thus  the  r-th  cumulant  of  X  is  *cr  =  2r-1(r  —  1)!, 

that  of  A iX  is  AJacr,  and  by  the  additive  property  (7),  the  r-th  cumulant  of  S 
is  given  by  the  expression  (2). 

3  Mathematical  approximations 

The  approximations  in  this  section  are  called  “mathematical”  because  they  are 
based  on  mathematical  analysis,  with  known  properties  of  accuracy  and  conver¬ 
gence,  in  contrast  to  those  to  be  considered  later. 

Suppose  n(t)  is  the  standard  normal  density 

n(<)  =  e"‘3/2/v/2^  (8) 

and  let  f(x)  be  the  (continuous)  density  of  X.  Then  it  is  (nearly  always)  possible 
to  expand  /(*)  as 

f(x)  =  n(x)  1 1  +  |(/i2  -  l)tf2(*)  +  ^3#3(*)  +  -^4  -  6ai2  +  3 )H4(x)  +  . . .  j 

(9) 

called  a  Gram-Charlier  series.  The  Hr{x)  are  Hermite  polynomials.  Lists  of 
Hermite  polynomials,  and  also  conditions  for  convergence,  etc.,  are  given  in 
Kendall  and  Stuart  (1977,  vol.  1). 

The  basic  technique  involved  in  deriving  (9)  rests  on  the  fact  that  Hermite 
polynomials  are  orthogonal  with  respect  to  the  kernel  n(x);  thus 

/"  Hi(x)Hj(x)n(x)dx  =  {  (10) 

Then  if  f(x)  =  c,n(x)i/j(z),  multiplication  by  Hj(x)  on  both  sides,  and 

integration,  gives 

ci  —  f  f(x)  Hj(x)dx/j\ 

.  When  worked  out,  c2  =  (p2  -  1)/2,C3  =  /i3/6,  etc. 

If  an  infinite  set  of  moments  is  available,  as  for  5,  the  density  can  be  ap¬ 
proximated  very  accurately  using  a  Gram-Charlier  series  of  sufficient  length,  but 
there  are  many  statistics  in  practical  applications  for  which  it  is  difficult  even 
to  get  the  first  four  moments  —  see  Solomon  and  Stephens  (1977)  for  examples. 
There  are  two  other  important  drawbacks: 
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1.  A  Jb-term  fit  might,  at  any  one  value  of  x,  be  worse  than  a  (k  —  l)-term 
fit. 

2.  Gram-Charlier  series  with  finite  numbers  of  moments  can  give  a  negative 
density  /(x),  particularly  in  the  tails. 

3.1  Percentage  points  approximation 

A  Gram-Charlier-type  expansion  can  also  be  found  for  F(x),  the  distribution 
function  of  X ;  this  can  be  inverted  to  give  a  percentage  point  for  a  given  cumu¬ 
lative  area  a.  Thus  suppose  F(xa)  =  a;  we  want  an  approximation  to  xa.  A 
Cornish-Fisher  expansion  gives  x  —  £  as  a  series  in  Hermite  polynomials  in 
x ,  or  (more  practically  useful)  in  where  (  is  the  percentile  corresponding  to 
a  for  the  normal  distribution,  that  is,  (  is  the  solution  of 

/* 

I  n(x)dx  =  or 

J  —OO 

Again,  problems  can  arise  with  the  convergence  to  the  desired  xa.  For  more 
details  on  mathematical  expansions  of  Gram-Charlier  or  Cornish-Fisher  type, 
see  Kendall  and  Stuart  (1977,  vol.  1). 


(11) 


4  Pearson  curves  and  other  systems 


We  now  turn  to  a  method  of  approximation  which  can  be  thought  of  as  “laying 
one  curve  upon  another”  —  the  approximating  curve  has  parameters  which  can 
be  varied  to  make  a  good  fit.  The  parameters  are  usually  chosen  by  matching 
moments  or  cumulants.  Percentage  points  of  the  approximating  curve,  which 
are  tabulated  or  otherwise  easily  found,  are  then  used  as  approximations  to  the 
desired  points. 

A  family  of  approximating  curves  is  the  Pearson  system,  where  the  (contin¬ 
uous)  density  f(x)  is  approximated  by  /*(x),  given  by 

1  #*(«)  _  a  +  x  fl2. 

/*(x)  dx  &0  +  61X  +  62X2’ 

According  to  the  values  of  the  constants  0,60,61,62,  integration  of  the  right- 
hand  side  will  take  many  forms,  giving  great  flexibility  to  the  system  of  densities 
/*(x).  With  considerable  algebra  (see  Elderton  and  Johnson,  1969,  for  details), 
the  constants  may  be  put  in  terms  of  the  moments: 


Suppose  A 
a 


10/J4/i2  —  18^2  —  12^|;  then 
H3(fi4  +  3  n\) 

A 


(13) 

(14) 
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60  = 

-02(40204  -  3p!) 

A 

(15) 

6i  = 

-a; 

(16) 

62  = 

~(2p204  -  3/4  -  12/4) 

(17) 

A 

Thus  knowledge  of  the  first  four  moments  or  cumulants  of  X  will  fix  the  con¬ 
stants  above:  a  further  constant  C  enters  on  integrating,  but  is  fixed  by  the  fact 
that  the  total  integral  of  /*(x)  must  be  1. 

4.1  Percentage  points 

When  the  constants  are  known,  the  density  /*  (z)  may  be  integrated  and  per¬ 
centage  points  solved  for  numerically.  Over  the  years,  this  was  done,  at  first 
very  laboriously,  for  a  small  range  of  possibilities,  but  a  quite  extensive  tab¬ 
ulation  was  made,  using  electronic  computers,  in  the  late  ’60s.  These  tables 
are  in  Biometrika  Tables  for  Statisticians,  vol.  II.  The  form  of  the  tables  is 
as  follows.  The  percentage  points  for  X,  the  standardised  ^-variable  given  by 
X  =  (x  —  p)/a,  are  plotted  in  a  two-way  table  indexed  by  the  skewness  and 
kurtosis  parameters  0\  and  02-  These  are  defined  by 

2 

Pi  =  ^  and  02  =  (18) 

02  02 

they  have  been  defined  to  be  scale-free,  and  v/J^T  takes  the  sign  of  /i3.  0X 
measures  skewness:  a  large  (positive)  \f$[  means  the  curve  is  skewed  towards 
positive  values  (long  tail  is  to  the  right)  and  vice  versa  for  negative  y/0i.  A 
large  02  (always  positive)  means  the  density  has  heavy  tails.  Of  course,  all 
symmetric  distributions  have  0\  =  0;  a  benchmark  to  measure  kurtosis  is  the 
normal  distribution  for  which  02  =  3.  Since  m  =  /*4  —  3/4,  the  parameter 
72  =  02  —  3  =  *4/^2  can  also  be  regarded  as  measuring  kurtosis,  with  value 
72  =  0  for  the  normal  distribution. 

Suppose,  for  a  given  5,  we  have  y/0\  =  0.8  and  02  =  4.6.  To  use  Biomeirika 
Tables,  one  enters  the  appropriate  \ffil  table,  \JJf\  =  0.8,  and  travels  down 
the  left-hand  column  until  the  02  value,  4.6,  is  reached.  Along  the  row  are  17 
tabulated  percentage  points  for  X,  from  o  =  0.00  to  a  =  1.00.  Interpolation 
must  be  used  for  02  values  not  explicitly  given. 

4.2  Un  peu  d’histoire 

At  this  point,  perhaps,  it  might  be  permitted  to  enliven  the  account  with  what 
the  Guide  Michelin  calls  un  peu  d’histoire.  At  the  time  Biometrika  Tables  Vol. 
II  were  being  prepared,  I  was  fortunate  enough  to  know  Professor  E.  S.  Pear¬ 
son,  then  retired  but  still  very  active,  especially  as  Editor  of  Biometrika.  He 
had  collaborated  with  workers  in  the  U.  S.  to  get  the  tables  (Johnson,  Nixon, 
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Amos  and  Pearson,  1963)  and  had  carefully  compiled  the  full  set  by  hand.  He 
had  introduced  me  to  Pearson  curves,  which,  to  put  it  mildly,  did  not  figure 
prominently  in  statistical  training  of  the  day,  and  had  shown  me  how  effective 
they  could  be.  He  gave  me  a  copy  of  the  tables  to  use.  I  undertook  to  write 
a  Fortran  program  on  the  IBM  650,  to  interpolate  and  find  points,  given  the 
first  four  moments.  All  20  tables  were  then  typed  onto  punched  cards;  in  the 
end,  I  got  it  down  to  approximately  45  minutes  per  table.  This  is  not  such  a 
dramatic  piece  of  history  as  Michelin  usually  provides  (assignations  and  assas¬ 
sinations  often  play  a  prominent  role),  but  a  diminishing  generation  of  modern 
readers  will  still  empathise  with  the  fears  of  losing  the  boxes  of  cards,  getting 
them  wet  in  the  snows  of  Montreal,  etc.,  not  to  mention  the  awful  discovery  of 
a  wrongly-typed  number! 

Since  then,  programs  have  been  written  to  integrate  the  density  equation 
for  /*(x)  numerically  and  to  solve  for  xa  for  given  a,  or  to  provide  the  tail 
area  for  given  x ;  one  of  these,  kindly  given  to  me  by  Amos  and  Daniel  (1971), 
has  been  added  to  my  program;  this  greatly  increases  the  range  of  /?i  and 
for  which  Pearson  curve  approximations  can  be  found.  However,  points  are 
still  output  from  both  the  Amos  and  Daniel  part  of  the  program  and  by  the 
Biometrika  Tables  part,  ostensibly  as  a  check  where  available,  but  truthfully  as 
a  sentimental  tribute  to  E.  S.  P. 

Later  on,  Charles  Davis  and  I  (Davis  and  Stephens,  1983)  added  to  the 
program  to  enable  a  fit  to  be  made  using  knowledge  of  an  end  point  (for  example, 
that  the  left-hand  endpoint  of  S  is  zero)  and  three  moments.  This  is  especially 
valuable  for  the  type  of  statistic  for  which  each  successive  moment  requires 
exponentially  increasing  hard  work  —  for  example,  the  distribution  of  areas,  or 
perimeters,  of  polygons  formed  by  randomly  dropping  lines  on  a  plane  —  see 
Solomon  and  Stephens  (1977).  The  Peareon-curve  fitting  program  is  available 
from  the  author. 

Further  developments  have  included  algorithms  to  facilitate  use  of  Pearson 
curves  —  see,  for  example,  Bowman  and  Shenton  (1979a,  1979b). 

4.3  Accuracy  of  Pearson  curve  fits 

(a)  Pearson  curve  densities  are  unimodal,  or  possibly  J-  or  U-shaped,  but  never 

multimodal.  They  are  also  never  negative. 

(b)  Percentage  points  or  tail  areas  found  from  Pearson  curve  fitting  have  been 

found,  for  unimodal  long-tailed  distributions,  to  be  very  accurate  in  the 
long  tail,  at  least  for  tail  areas  bigger  then  0.005,  or  the  0.5%  point. 
Pearson  and  Tukey  (1965)  discuss  this  issue;  Solomon  and  Stephens  (1977) 
give  comparisons.  (In  making  comparisons,  one  must  of  course  compare 
the  Pearson  curve  fit  with  the  correct  x0,  or  the  correct  area  for  given  x, 
for  a  distribution  which  is  not  itself  a  member  of  the  Pearson  family.) 
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(c)  Davis  (1975)  has  made  extensive  comparisons  with  Gram-Chariier  fits  using 
only  four  moments.  Pearson  curve  fits  are  better  than  Gram-Charlier  fits 
everywhere  except  for  distributions  very  close  to  the  normal,  as  measured 
by  the  ,  fo  values. 

4.4  Other  systems 

Johnson  (1949)  has  proposed  another  family  (divided  into  three  parts)  of  curves 
defined  by  four  moments:  for  example,  the  Su  curves  are  those  given  by  the 
relation 

£  =  7  +  5sinh-1  A  (19) 

where  X  —  (x  —  /i)/<r,  and  7,5  are  to  be  chosen  to  make  the  distribution  of 
£  as  close  as  possible  to  N( 0, 1).  A  discussion,  and  tables  to  facilitate  the 
calculation  of  7  and  S,  are  in  Biometrika  Tables  for  Statisticians  Vol.  II.  Other 
authors  have  also  proposed  families  of  distributions,  but  they  have  not  come 
into  such  common  use  for  the  purpose  of  approximating  percentage  points. 


5  Use  of  higher  moments 

We  now  turn  to  the  first  of  two  interesting  questions  —  can  higher  moments 
be  used  to  improve  the  accuracy  of  Pearson  curve  fits  in  the  long  tail  of  the 
distribution?  The  long  tail  will  be  supposed  to  lie  to  the  right,  as  for  the 
distribution  of  5;  then,  since  higher  values  of  x  will  contribute  more  to  the 
higher  moments  than  smaller  values,  we  might  suppose  that  fits  using  higher 
moments  will  improve  accuracy.  Unfortunately  it  is  not  easy  to  establish  the 
four  constants  in  terms  of  higher  moments  —  of  course,  only  four  of  these  would 
be  needed  to  fix  the  constants.  A  recursion  formula  exists  to  generate  higher 
moments,  for  r  =  2, 3, . . 

r6oA«r-i  +  {(f+l)<,i+o}^r  +  {(r  + 2)62 +  l}^+i  =0  (20) 

In  this  recursion,  the  constants  a,  bo,  61  and  62  occur,  and  this  means  that  one 
cannot  reverse  the  recursion  and  generate  ,  say,  /j  and  <r2  from  and  /i6- 

Nevertheless,  one  can  generate  the  fifth  and  sixth  moments  of  the  Pearson 
curve  with  the  same  first  four  moments  of,  say,  5,  and  compare  them  with  the 
true  fifth  and  sixth  moments  of  S.  The  first  two  moments  are  then  slightly 
changed,  and  the  procedure  successively  repeated,  until  the  third,  fourth,  fifth 
and  sixth  moments  of  each  curve  match.  This  will  mean  that  the  mean  and 
variance  of  the  Pearson  curve  will  not  be  exactly  the  same  as  those  for  S, 
although  they  will  be  close,  and  this  will  probably  make  a  worse  fit  in  the  lower 
tail;  but  for  higher  x  the  fit  could  improve.  I  have  made  some  comparisons  using 
this  procedure,  but,  as  one  might  expect,  there  appears  to  be  no  systematic 
improvement.  In  discussion,  when  this  paper  was  first  presented,  the  suggestion 
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was  made  to  use  Least  Squares  to  make  “closest”  fits,  in  order  to  compare  the 
six  moments.  More  work  is  needed  to  compare  Pearson  curve  fits  along  these 
various  lines,  but  it  is  not  likely  that  the  improvement  will  be  sure,  or  will 
extend  to  points  far  into  the  tails.  In  the  end  it  must  be  remembered  that  one 
curve  is  simply  being  laid  on  top  of  another,  with  only  four  parameters  to  vary, 
and  there  is  no  mathematical  analysis  that  will  guarantee  accuracy. 

Other  methods  for  developing  accuracy  in  the  extreme  tails  include  numerical 
inversion  of  the  Characteristic  Function  (essentially  the  M^F  with  it  replacing  t, 
where  »  =  >/— I),  or  saddlepoint  approximations.  A  method  due  to  Imhof(1961) 
uses  numerical  inversion  for  distributions  such  as  S,  but  the  computer  time 
needed  increases  rapidly  as  the  distance  into  the  tails  increases  (to  give  small 
tail  areas).  Field  (1992)  has  recently  examined  saddle-point  approximations  for 
S.  These  would  seem  to  give  more  promise  of  tail-end  accuracy  in  the  long  run. 

6  Use  of  sample  moments 

The  second  interesting  question  is:  how  accurate  are  Pearson  curve  fits  when 
sample  moments  are  used  to  make  the  fit?  In  the  earliest  days,  this  was  the  use 
to  which  Pearson  curves  were  applied  —  to  find  a  smooth  density  to  describe 
a  set  of  data,  such  as  lengths  of  beans,  or  width  of  skulls.  Kendall  and  Stuart 
( 1977,  Vol.  1  )  gives  details  of  such  a  fit.  I.i  general,  the  Pearson  curves  will  give 
very  good  fits  to  a  unimodal  set  of  data,  or  even  to  J-shaped  or  U-shaped  sets, 
but  it  is  important  to  assess  the  accuracy  of  extrapolation  from  the  sample  to 
the  supposed  population  from  which  it  came.  More  precisely,  we  ask  how  close 
the  sample  fit  estimate  of,  say,  the  upper-tail  5%  point  is  to  the  true  population 
5%  point,  and,  further,  whether  or  not  the  Pearson-curve  point  is  better  than 
the  estimated  point  derived  from  choosing  the  appropriate  order  statistic  —  in  a 
sample  of  1000,  the  951st  value  in  ascending  order,  or  in  a  sample  of  size  10000, 
the  950 1st  value.  Some  investigation  of  these  questions  has  been  undertaken  in 
two  quite  different  ways,  by  Johnstone  (1988)  and  by  myself  (Stephens,  1991). 

The  accuracy  of  the  Pearson  curve  point  will  depend  on: 

1.  the  sample  size  n, 

2.  the  a-level  (tail  area)  of  the  point  required, 

3.  the  true  skewness  and  kurtosis  of  the  density  approximated, 

4.  higher  moments. 

Johnstone  gives  a  small  study,  for  samples  from  populations  with  the  following 
range  of  parameters: 


01 
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1.0 
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Johnstone  gives  plots  of  the  estimated  coefficient  of  variation,  CV,  of  the 
Pearson  curve  xa  against  -  logo  ,  where  the  base  of  logarithms  is  10.  Thus  the 
CV  of  the  estimated  z0  oi  is  plotted  against  2,  that  of  the  estimated  *o  ooj  is 
plotted  against  3,  etc  .  The  coefficient  of  variation  is  estimated  using  a  Taylor 
series  approximation.  As  one  might  expect,  the  CV  goes  up  markedly  as  or  gets 
smaller  (so  -  logo  gets  larger  on  the  z-axis),  and  the  steepness  of  the  rise  is 
greater  for  the  more  skew  distributions  . 

In  Stephens  (1991),  Monte  Carlo  samples  were  taken  from  populations  for 
which  exact  perc  ntage  points  could  be  found,  and  the  exact  points  were  com¬ 
pared  with  those  obtained  from  (a)  Pearson  curve  fits  using  the  moments  of 
each  sample,  and  (b)  the  order  statistic  estimate  from  eacu  sample.  The  order 
statistic  estimate  will  be  asymptotically  unbiased,  while  one  can  say  nothing 
exact  about  the  point  obtained  by  laying  one  curve  on  another;  recall  that  sam¬ 
ple  moments,  especially  the  third  and  fourth,  are  extremely  variable,  even  for 
quite  large  samples.  The  results  showed,  as  expected,  that  the  Pearson  curve 
points  were  more  biased.  However,  somewhat  surprisingly,  they  had  smaller 
mean  square  error.  Therefore,  it  might  well  be  preferable  to  use  the  Pearson 
curve  points,  although,  again,  more  investigations  should  be  made  especially  if 
the  points  required  are  far  into  the  tail. 


7  Goodness  of  fit  using  moments 

In  this  second  part  of  the  paper,  we  discuss  how  moments  are  used  in  Goodness- 
of-Fit,  that  is,  to  test  whether  a  random  sample  comes  from  a  given  (continuous) 
distribution.  The  distribution  will  often  have  unknown  parameters,  which  must 
be  estimated  from  the  given  sample. 


7.1  Tests  based  on  skewness  and  kurtosis 

Suppose  the  r-th  sample  moment  m T  about  the  mean  is  defined  by 

"*r  =  “*)r-  (21) 

1=1 


The  sample  skewness  and  sample  kurtosis  are  then  defined  by 


m|  m< 
=  — *  • 


m% 


(22) 


These  statistics  are  not  unbiased  estimates  of  and  fa ,  but  they  are  consistent, 
that  is,  the  bias  diminishes  with  increasing  sample  size.  The  sample  skewness 
and  kurtosis  are  time-honoured  statistics  for  testing  normality,  having  been  used 
in  a  rather  ad  hoc  manner  for  most  of  this  century;  b\  is  compared  with  zero, 
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and  62  with  a,  the  value  of  &  for  the  normal  distribution.  However,  distribu¬ 
tion  theory  of  bi  and  62  is  difficult,  and  it  is  only  since  computers  have  been 
available  that  extensive  and  reliable  tables  of  significance  points  have  existed  for 
these  statistics.  Further,  bi  and  62  can  be  combined  to  give  one  overall  statistic 
(d’Agostino  and  Pearson,  1973, 1974;  d’Agoetino,  1986).  For  other  distributions 
Bowman  and  Shenton  (1986)  have  also  given  tables  for  these  statistics.  Stud¬ 
ies  have  shown  that  skewness  and  kurtosis,  especially  combined,  provide  good 
omnibus  tests  for  normality,  although  less  is  known  for  other  distributions.  For 
the  important  discrete  distribution,  the  Poisson,  all  cumulants  are  equal  to  the 
mean,  denoted  by  the  parameter  A;  a  time-honoured  test  for  the  Poisson  is 
based  on  the  ratio  of  sample  variance  to  sample  mean,  which  of  course  should 
be  about  one.  Again,  this  simple  statistic  appears  to  compete  well  with  others 
in  terms  of  power. 

7.2  A  formal  technique  based  on  moments 

Perhaps  because  of  the  variability  of  sample  moments,  which  makes  calculation 
of  significance  points  difficult  for  statistics  based  on  these  moments  when  calcu¬ 
lated  from  samples  of  reasonable  size,  it  took  some  time  to  formalize  a  technique 
based  on  moments.  Gurland  and  Dahiya(1970)  and  Dahiyaand  Gurland  (1972) 
have  however  devised  a  general  procedure.  The  essential  steps  are  as  follows: 

1.  A  vector  C  of  length  «,  say,  must  be  found,  whose  components  G  are  func¬ 
tions  of  the  theoretical  moments,  and  such  that  each  component  G  is  linear 
in  the  parameters.  (This  might  involve  re-parametrising  the  distribution 
from  its  usual  form). 

2.  The  estimate  h  of  £  is  obtained  by  replacing  theoretical  moments  by  sam¬ 
ple  moments. 

3.  The  test  statistic  is  then  based  on  the  difference  h  —  f. 

Suppose  that  E  is  the  covariance  matrix  of  h,  0  is  the  9-vector  of  unknown 
parameters,  and  W  is  the  s  x  9  matrix  such  that  (  =  W6.  Then  define 

Qt  =  n(h  -  Wdyt-'ih  -  WO), 

where  0  =  (W,'L~1W)~1W'ti~1 3 * * *h.  The  statistic  0  is  the  regression  estimate  of 
0  obtained  by  generalized  least  squares,  and  E  is  E  with  the  estimate  6  used 
wherever  0  appears. 

Gurland  and  Dahiya  (1970,  1972)  showed  that,  asymptotically,  the  test 
statistic  Qt  has  the  x7  distribution  with  t  —  s—q  degrees  of  freedom.  Currie  and 
Stephens  (1986,  1990)  have  studied  the  procedure,  and  show  several  properties 
of  Qt .  Among  these  are  the  fact  that  the  test  statistic  Qt  can  be  broken  into 
t  components,  each  with  asymptotic  distribution  X\<  and  each  testing  different 
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8  Components  of  other  goodness-of-fit  statis¬ 
tics 

Other  goodness-of-fit  statistics  also  have  components  which  are  functions  of 
moments.  The  oldest  of  these  was  proposed  by  Neyman  (1937),  in  connection 
with  a  test  for  uniformity. 

A  test  for  a  fully  specified  continuous  distribution  (that  is,  all  parameters 
known)  can  always  be  converted  to  a  test  for  uniformity  by  means  of  the  Prob¬ 
ability  Integral  Transformation,  and  a  test  for  the  exponential  distribution  can 
also  be  so  converted,  even  when  the  scale  and  origin  parameters  are  not  known, 
so  that  Neyman ’s  test  has  wider  applicability  than  it  might  at  first  appear.  (For 
details  of  these  transformations,  see  Stephens,  1986a,  1986b). 

Neyman’s  test  is  as  follows:  suppose  the  test  is  that  Z  has  a  uniform  distri¬ 
bution  between  0  and  1,  written  U(0, 1).  On  the  alternative,  let  the  logarithm 
of  the  density  of  Z  be  expanded  as  a  series  of  Legendre  polynomials: 

l°g(/(*))  =  ^(C){1  +  ciLi(z)  +  cjLjfz)  +  C3La(z)  +  •••},  (23) 

where  the  c,  are  coefficients,  components  of  the  vector  c,  Z-,(z)  is  the  i-th 
Legendre  polynomial,  and  A(e )  is  a  normalising  constant. 

A  test  for  uniformity  is  then  a  test  that  all  c<  =  0.  The  estimates  of  c,  are 

«  =  X>(*i)  (24) 

>=i 

where  zi,  zj, . . . ,  z„  is  the  given  sample. 
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The  first  few  Legendre  polynomials  are  best  expressed  in  terms  of  y  =  2—0.5. 
Then 


Li(z)  =  2yfiy, 

(25) 

I2(z)  =  Vbfiy2  —  0.5), 

(26) 

L3(z)  =  ^7(20^- 3 y), 

(27) 

so  that  the  estimate  ci  becomes  a  function  of  the  first  moment  about  the  known 
mean  0.5,  the  second  estimate  62  becomes  a  function  of  the  second  moment,  S3 
a  function  of  both  the  third  and  the  first  moments,  etc. 

Neyman  shows  that  the  suitably  normalised  e,  have  asymptotic  N(0, 1)  dis¬ 
tributions,  and  his  overall  test  statistic  is  the  sum  of  the  squares  of  these  nor¬ 
malised  estimates.  Thus  the  overall  statistic  has  an  asymptotic  x2  distribution, 
just  as  for  the  Dahiya-Gurland  statistic,  and  the  individual  terms,  based  on 
moments,  are  the  components  of  the  overall  test  statistic. 

9  EDF  statistics 

Another  important  family  of  goodness-of-fit  statistics  is  that  derived  from  the 
Empirical  Distribution  Function  (EDF)  of  the  z-sample.  This  family  includes 
the  well-known  Koimogorov-Smirnov  statistic  and  the  Cramer- von  Mises  family 
of  statistics  (for  details  and  tests  for  many  distributions  based  on  these,  see 
Stephens,  1986a). 

One  of  the  most  important  of  the  Cramer- von  Mises  class  is  A2,  introduced 
by  Anderson  and  Darling  (1954).  The  definition  of  A2  is  based  on  an  integral 
involving  the  difference  between  the  EDF  and  the  tested  distribution  F(x)  (with 
parameters  estimated  if  necessary).  The  working  formula  is 

A2  =  -n-^  £(2»  -  1)  [log  2(j)  +  log(l  -  *(n+i_o)]  ,  (28) 

• 

where  z<  =  F(*j),  and  Z(<)  are  the  order  statistics. 

As  an  omnibus  test  statistic,  A2  has  been  shown  to  perform  well  in  many 
test  situations. 

Anderson  and  Darling  showed  that  the  asymptotic  distribution  of  A2  is, 
like  S  of  Section  1,  a  sum  of  weighted  x2  variables.  The  individual  terms 
in  the  sum  can  again  be  regarded  as  components  of  the  entire  statistic,  and 
Stephens  (1974)  has  investigated  these  components  in  some  detail.  A  remarkable 
result  is  that  they  too  are  based  on  Legendre  polynomials,  so  that  they  are 
effectively  the  same  as  the  Neyman  components,  based  on  moments  of  the  z- 
sampie.  There  has  been  some  investigation  of  components  of  these  and  other 
statistics,  as  individual  test  statistics  for  the  distribution  under  test;  references 
are  given  by  Stephens(  1986a).  As  for  the  Gurland-Dahiya  components,  they  can 
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be  expected  to  be  sensitive  to  different  departures  from  the  tested  distribution. 
The  complete  test  statistics  of  Neyman  and  of  Anderson-Darling  combine  the 
same  components,  but  with  different  weightings. 
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